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(57) Abstract 

A word breaker utilizing a lex- 
icon module and a processing module 
to identify word breaks in a stream of 
Asian (e.g. Japanese, Chinese, or Ko- 
rean) language text. The lexicon mod- 
ule is a dictionary or database contain- 
ing words native to the language of the 
input text. The processing module in- 
cludes a plurality of analysis modules 
which operate on the input text. In 
particular, the processing module can 
include modules that analyze the in- 
put text using heuristic rules and sta- 
tistical analysis to identify a first set 
of word breaks, thereby reducing the 
size of segments with undefined word 
breaks. The processing module also in- 
cludes a database analysis module that 
identifies the remaining undefined word 
breaks in those smaller segments that 
have undergone heuristic or statistical 
analysis. 
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METHOD AND APPARATUS FOR BREAKING WORDS 
IN A STREAM OF TEXT 

5 

FIELD OF THE INVENTION 
The present invention relates to automated language analysis systems, 
and relates to such systems embodied in a computer for receiving digitally encoded text 
1 0 composed in a natural language. More particularly, the invention relates to an efficient 
and accurate method and apparatus for determining word breaks in digitally encoded 
Asian language text. 

BACKGROUND OF THE INVENTION 
1 5 Automated language analysis systems embedded in a computer typically include 

a lexicon module and a processing module. The lexicon module is a table of lexical 
information, such as a "dictionary" or database containing words native to the language 
of the input text. The processing module includes a plurality of analysis modules which 
operate upon the input text in order to process the text and generate a computer 
20 understandable semantic representation of the natural language text. Automated natural 
language analysis systems designed in this manner provide for an efficient language 
analyzer capable of achieving great benefits in performing tasks such as information 
retrieval. 

Typically the processing of natural language text begins with the processing 
25 module fetching a continuous stream of electronic text from an input module. The 

processing module then decomposes the stream of natural language text into individual 
words, sentences, and messages. For instance, individual words in the English language 
can be identified by joining together a string of adjacent character codes between two 
consecutive occurrences of a white space code (i.e. a space, tab, or carriage return). 
30 Japanese language text, and other Asian languages such as Chinese and Korean, 

can not be separated into individual words as easily as English language text. Asian 
language text typically includes a string of individual characters each separated by 
white-space. Words in these Asian languages are formed of a single character or a 
successive groups of characters, but the boundaries between the words are not explicitly 
35 identified in the written text. The written text does not clearly indicate whether any 
particular character forms a complete word or whether the particular character is only 
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part of a word. In addition, the written characters may be from one or more character 
alphabets. For example, Japanese words may be formed in one of three character types- 
Katakana, Hiragana, Kanji, and Romaji characters. Identifying these ambiguous word 
boundaries between the characters proves important in electronically translating or 
processing Asian language documents. 

Some prior art systems attempt to determine these word boundaries with simple 
pattern matching rules while other prior art systems resort to using a database of Asian 
language words to identify word breaks in Asian language text. For instance, US Patent 
No. 5,029,084, issued to Morohasi et al., discloses a system that combines various 
pattern matching approaches to determine, word boundaries in the text. The Morohasi 
system .dentif.es character divisions based on character type definitions (i.e Katakana 
Kanj,. Hiragana) and then processes the sentence by comparing the characters to a 
content word dictionary containing Japanese words. For any character segments 
remaining after this initial processing, a series of compound word synthesizing rules are 
used to determine the division of the remaining segments. This system has the 
drawback of performing an up front costly comparison analysis of the characters in the 
stream of text with a content word dictionary of the Japanese language. 

Other prior art systems use morpheme analysis to determine the word breaks in a 
Japanese language sentence. US Patent No. 5,268,840, issued to Chang et al., describes 
a method and apparatus for morphologizing text. The Chang system discloses 
segmenting the input text of characters into the longest morphemes that can be formed 
from the input text. This is achieved by forming the longest morpheme from the 
remaining characters in the sentence which is listed in a dictionary of valid morphemes 
and determining if it is conjunctive with the previously divided morpheme The 
conjunctiveness of successive morphemes can be based upon grammar rules that require 
two adjacent morphemes to obey certain rules of connection. 

Morphological analyzers of the type disclosed in Chang have efficiency 
problems. For example, subsequent identification of morphemes beyond the initially 
identified morpheme may indicate that the earlier identified morphemes are incorrect 
and require further analysis. This inherent recursive nature of the system causes 
•"efficiencies in the processing of the input text. In addition, the morphological analysis 
of Chang requires two separate processing steps. In the first step, the system identifies 
the morphemes themselves and in the second step the system requires the application of 
the morphological rules to the entire document. Thus, a morphological analysis system 
typically requires considerable computer processing effort and frequent database 
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accessing resulting in longer processing times, coupled with the ever present risk of 
needing to review and reassess earlier faulty analysis. 

Accordingly, an object of the invention is to provide a word breaker that 
efficiently and accurately identifies word breaks in a stream of Asian language text. 

Other general and specific objects of the invention will be apparent and evident 
from the accompanying drawings and the following description. 

SUMMARY OF THE INVENTION 

The invention provides an apparatus and method for identifying word breaks in 
Asian language text that overcomes the limitations in the above described techniques. In 
particular, the present invention achieves accuracy and efficiency by applying 
computationally expensive procedures to word segments having unidentified word 
breaks only after less computationally expensive procedures have been exploited to 
reduce the size of the word segments by unidentifiying word breaks. The invention 
discloses an apparatus that first analyzes the stream of text with computationally 
inexpensive processing, thereby reducing the size of the segments with undefined word 
breaks within the stream of text. The inventive system then analyzes these smaller 
segments with more computationally expensive processing in a manner that does not 
require reexamination of the earlier analysis steps. 

For instance, one aspect of the word breaker includes an element for storing the 
input character string, a first memory element for storing a character-transition table, a 
second memory element for storing a dictionary, a statistical analysis module for 
reducing the number of unidentified word breaks in the stream of text, and a database 
analysis processor for locating the remaining unidentified words breaks in the stream of 
text. The statistical analysis module, a less computationally expensive process than a 
dictionary look-up, examines a first segment in the stream of text to locate a first word 
break. The statistical analysis module, using computationally inexpensive processes, 
then partitions the stream of text into at least a first sub-segment and a second sub- 
segment divided by the first word break. The first and second sub-segments are then 
analyzed using the more computationally expensive database analysis processor to 
identify the remaining word breaks in the first segment. The database analysis processor 
identifies the remaining word breaks by comparing the characters in the first and second 
sub-segments with entries in the dictionary stored in the second memory element. 

The invention further provides for a statistical analysis processor that identifies 
the first word break with the aid of data stored in the character-transition table. For 
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instance, the statistical processor can identify the first word break as a function of a 
statistical morpheme stored in the character-transition table. In one instance, the 
statistical processor can compare the characters in the first segment of the input stream 
with the entries in the character-transition table, and the statistical processor can then 
5 identify those characters in the stream of text that match entries in the character- 
transition table. Once a match is found, the statistical processor can then associate a 
character-transition tag with the matched characters in the stream of text. The associated 
character-transition tag identifies the existence of a concatenation between successive 
characters, a break between successive characters, or an unknown transition between 
1 0 successive characters. 

Additional features of the statistical analysis processor provide for a windowing 
module for forming a window of successive characters from the stream of text The 
windowing module also includes structure for comparing the window of successive 
characters with entries in the character-transition table. This allows a window of 
characters to be compared with entries in the character-transition table and to identify 
whether the window of characters forms a statistical morpheme as registered in the 
character-transition table. 

Further in accordance with the windowing module, the invention can further 
include a processing element for sliding the window of successive characters along the 
first segment of the stream of text. This provides for a system that can compare 
successive groupings of characters within the stream of text with the entries in the 
character-transition table. Additionally, the windowing module can include a processing 
element that controls the number of characters in the window, that is the size of the 
sliding window. This allows various numbers of successive characters to te compared 
25 with the entries in the character-transition table. 

Another aspect of the invention includes a heuristic rule table and a heuristic 
rule module for further reducing the number of possible character combinations forming 
words m the input stream of text. The heuristic rule module and its associated heuristic 
rule table are typically computationally inexpensive procedures that are applied before 
30 the database analysis processor is utilized, thereby reducing the size of word segments 
havmg unidentified word breaks that must be processed by the database analysis 
processor. The heuristic rule table and heuristic rule module can identify word breaks in 
the ,nput character string; for Japanese, these are based upon numbers, punctuation. 
Roman letters, classifiers, particles, honorific prefixes, emperor years, and Kanji- 
35 Katakana character-transitions. 
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Other features of the invention include a word verification module for verifying 
matches between identified words in the input character string, and a character-transition 
table that includes character strings of morphemes that form a minimum spanning set 
necessary to identify character-transitions in the input character string. Additionally, the 
5 entries in the character-transition table can be of variable length. 

One preferred method of the invention for locating unidentified word breaks in a 
input character string includes the steps of: storing the input character string, identifying 
a morpheme in a first segment of the input character string, reducing the "number of 
unidentified word breaks in the input character string based upon the identified 

10 morpheme, and locating the remaining unidentified words breaks in the input character 
string by comparing the segments of the input string with a dictionary. This method 
employs computationally expensive procedures only after less expensive computational 
procedures have been employed to reduce the size of the segments having unidentified 
word breaks. For instance, the inventive method first identifies a word break in the 

1 5 input character string based upon a morpheme, the identified word break divides the 
input string into first and second sub-segments having the remaining unidentified word 
breaks. The first and second sub-segments are then analyzed using the computationally 
more expensive process of comparing the sub-segments with a dictionary. 

An additional aspect of the invention provides for a machine readable storage 

20 medium having instructions stored thereon that direct a programmable computer to 

implement the word breaker disclosed herein. The instructions on the machine readable 
storage medium implement a first element for reducing the number of unidentified word 
breaks in a character string by locating a first word break in a first segment of the 
character string as a function of at least one statistical morpheme in the first segment, the 

25 first word break dividing the first segment into a first sub-segment and a second sub- 
segment, and a second element for locating substantially all of the remaining 
unidentified word breaks in the first and second sub-segments by comparing the first and 
second sub-segments with entries in a dictionary of lexical entries. 



\ 



WO 98/08169 




PCT/US97/14741 



BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 isablockdiagramofawordbreakeraccordingtothepresentinvention: 
5 FIG. 2 is a flow chart illustrating the processing steps of the heuristic rule 

analysis module of FIG. 1; 

FIG. 3 is a flow chart illustrating the processing steps of the statistical analysis 
module of FIG. 1; 

FIG. 4 is a detailed flowchart showing the steps of the database verification 
10 analysis of FIG. 1; 

FIG. 5 is a flowchart illustrating the steps of the database analysis of FIG 1 and 
FIG. 6 is a detailed flowchart illustrating the comparison step of the database 
analysis procedure of FIG. 5. 
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DETAILED DESCRIPTION 

FIGURE 1 illustrates a word breaker 5 in accordance with the invention 
The word breaker 5 comprises a processing module operably coupled with a lexicon 
module. The processing module can include a statistical analysis module 20, a database 
analysis module 24, an optional heuristic rule analysis module 12, and an optional 
database verification module 22. The separate sub-modules 12, 20, 22 and 24 of the 
processing module are operably coupled together to allow the transfer of data and 
control signals between the sub-modules. The lexicon module can include a character- 
transmon table 16, a dictionary 18 with lexical entries, and an optional heuristic rule 
table 14. The character-transition table 16 is operably coupled with the statistical 
analysis module 20, the dictionary 18 is operably coupled with the database verification 
module 22 and with the database analysis module 24, and the heuristic rule table 14 is 
operably coupled with the heuristic rule analysis module 12. As further illustrated in 
FIG. 1 , the word breaker 5 can include an input module 1 0 for receiving the stream of 
mput text and an output module 26 for generating an output signal representative of the 
stream of input text with identified word breaks. 

The word breaker 5 can be implemented using electronic hardware devices 
software, or a combination of the two. For example, a digital computer running a UNIX 
platform can be loaded with software instructions to implement the structure and 
processes of the word breaker 5. The input module 10 can be a keyboard, a text 
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processor, a machine readable storage device, or any other structure capable of 
generating or transferring a stream of text. The processing module and its sub-modules 
12, 20. 22 and 24 can be implemented through software instructions executed by the 
processor in the digital computer or by specifically designed hardware components 
5 implementing the equivalent instruction set. The lexical module and its sub-modules 14 ? 
16 and 1 8 can be implemented by tables stored in a volatile or a non- volatile memory 
device that is operably coupled with the processing module. The output module can be 
implemented using any device capable of storing, displaying, or transferring the signals 
generated by the processing module of the invention. For instance the output module 

10 can include a SCSI hard drive, electronic memory devices, or a video monitor. 

In operation, the input module 10 either receives or generates the input stream of 
text that requires identification of word boundaries. The input module 10 can either pre- 
process the text or it can directly transfer the stream of incoming text to the heuristic rule 
analysis module 12. For instance, the stream of Japanese text entering the input module 

15 10 can be represented electronically using JIS (Japanese Industry Standard), Shift-JIS, 
EUC (Extended Unix Code) or Unicode, wherein each Japanese character is represented 
by two bytes, but the input data can contain single-byte or double-byte characters. It is 
preferable to have the Japanese text represented in one standard format such as 
Unicode. Accordingly, the input module 10 advantageously provides pre-processing to 

20 convert the input stream of text into a standard format, such as Unicode. 

The input module 10 can also associate tags with each character in the input 
stream of text. The tags associated with each electronic equivalent of the characters 
identify attributes of the characters. In particular, a character-transition tag can be 
associated with each character such that the character-transition tag identifies word 

25 breaks between characters in the input stream of text. For instance, in the present 
invention, character-transitions can be represented as a tri-state value with a 0 
representing an unknown transition, a 1 representing an a break between the characters,, 
and a 2 representing that there is a link, i.e., a concatenation, between the characters. 
Each transition is represented by a two-bit flag and accordingly the bit values for 

30 UNKNOWN, BREAK, and LINK are 00, 0 1 , and 10 respectively. For an "N" character 
pattern, there are "N+l M of these bit pairs which are necessary to completely represent 
the transition. For example: 
the pattern |A_B?C|; 
wherein: " | M indicates a break, 

35 " _ " indicates a link, and 

M ? M indication an unknown transition; 
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can have character-transition tags represented as: 01 1 0 00 01 

The optional heuristic rule module 12 receives the stream of text from the input 
module 10 and transmits a heuristically processed stream of text to the statistical 
5 analysis module 20. Alternatively, the heuristic module can be removed from the word 
breaker 5 such that the stream of text output by the input module 10 passes directly to 
the statistical analysis module 20. 

In general, the heuristic rule analysis module 12 identifies a character-transition 
» the character stream such that the number of possible character combinations forming 
10 words m the character string are reduced. For example, the heuristic rule module can 
identify character-transitions based on a classification of the character type, wherein the 
character-transitions identify either word breaks between successive characters a link 
between successive characters, or an unknown transition between successive characters 
The heuristic rule module 12 acts in cooperation with the heuristic rule table 14 to 
1 5 identify the character-transitions in the character stream. 

Once the character-transitions are identified the heuristic rule module sets the 
character-transition tags associated with each identified character-transition to either a 
either a break between characters, or a link between characters. The heuristic rule 
module 12 then passes all character-transition tag data to the statistical rule module 14 
The statistical rule module reads the character-transition tag data and analvzes only 
those segments having unknown interior character-transitions. 

The statistical analysis module 20 either receives a stream of data from the 
optional heuristic rule analysis module 12 or from the input module 10. The statistical 
analyse module reduces the number of unidentified word breaks in the character string 
based upon a statistical morpheme in the character string. A statistical morpheme is a 
morpheme that is identified based upon a statistical analysis of the frequency of 
occurence of the morpheme in a corpus of text. The statistical analysis module 
cooperates with the character-transition table 16 to identify the statistical morphemes in 
the input character stream. 

In operation, the statistical analysis module 20 analyzes all remaining unknown 
character-transit.ons in the input text by comparing a sequence of characters containing 
unknown character-transitions in the input text to sequences of characters stored in the 
character-transition table 16. The character-transition table 16 contains entries of 
character segments, i.e. statistical morphemes, which are chosen to statistically predict a 
character-transition based on the order of characters. Accordingly, a character-transition 
tag associated with the input character stream can be set when a sequence of characters 
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in the input text match an entry in the character-transition table. In those aspects of the 
invention wherein the word breaker 5 utilizes both the heuristic analysis module 12 and 
statistical analysis module 20, it is estimated that approximately 90% of the character- 
transitions in the input character stream are accurately identified. 

After processing of the stream of text by the statistical analysis module 20, the 
input character stream can be either processed by the database verification module 22, 
the database analysis module 24, or directly by the output module 26. For instance, 
when the input character stream has no remaining unidentified character-transitions after 
processing by the statistical analysis module 20, processing procedes directly to the 
output module 26. 

The database verification module 22 is an optional module that compares the 
input text segments identified as words by the character breaks, to lexical entries 
contained in the dictionary 1 8. If the word is correct and properly verified it is passed to 
the output module. After processing by the verification module, which includes 
adjustments or corrections of spurious word forms that remain after statistical analysis, 
approximately 95% of the character-transitions are accurately identified. The remaining 
unknown input text segments with character-transitions are passed to the database 
analysis module 24. The database analysis module 24 compares the remaining unknown 
character-transitions to the lexical entries in the dictionary 18 to determine the final 
word breaks. The text, now with spaces between the words, is passed to the output 
module 26. 

FIG. 2 illustrates a flow diagram for implementing the heuristic rule analysis 
module 12. The heuristic rule module in cooperation with a heuristic rule table identifies 
characters in the text that consistently identify character-transitions, i.e., a break between 
characters or a link between characters. For example, in one embodiment of the present 
invention the heuristic rule module identifies character-transitions formed by numbers, 
Roman letters, punctuation marks, classifiers, particle delimiters, repeat characters, 
honorific prefixes, Kana/Kanji combinations, and end of sentence markers. 

After a character-transition is identified, the heuristic rule module sets a 
character-transition tag associated with each character to either a break or a link. Once a 
character-transition tag is set, the character-transition tag need not be modified by any 
further processing because of the accuracy of the processing used to set the character- 
transition tags. The results from the heuristic rule analysis module 12 may be one of 
three types: 1 ) an identified word; 2) a segment with unknown interior character- 
transitions; 3) and an unidentified segment. An identified word is one in which there is 
a break before the first character and after the last character with all character-transitions 
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« as a I,„k ,n the interior of the identified word. A segment with unknown character 
t— s wUl have a break before the first character and after the last ch^ b 
w.l have at least one unknown character-transition in the interior of the eltn 1 
^ef ed ~ - a succession of characters having an ins^ZZT 

processing. After step 29, step 30 is executed. 12 for 

Step 30 analyses the input stream of text with a set of heuristic rules for 

All forms of numbers-single and double-bvte Arahir n.,™ i . •• 
heunsac ru,e module .2. ,„ 1dh«. numbCTS followed by 

, . H u ureeK, or Cyrillic) as one word segment Th^ 

The heuristic rule module also places word hroiUc u-e ^ , 

*n«I po, nl s used i„ numbere wi „ nol have . ^ ^ J f ^™ 
comma or decma, poin, Tab,e . „ a remi v c .6,= rt a, idmlifies "* 
charac^s ,„ U,e are™ of „ are ,„ te classified „ ^ « 
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Table 1. 
Punctuation Marks 



Phr2sal Marks 

\& - double-byte: s = single-byte] 

(Japanese comma) 

(main) 

(d) 

(?) 

<ri) 

<-*> 

(ri) 
(d) 



(s) 

m 

(5) 

(d) 
(s) 
W) 
(s) 
(d) 



Range marks: 
~ (d) 

— (d> 
(d) 
(si 

~ (d-byie luinus sign) 

/ (c!) 
\ (d) 



Quotation marks 

(d> r (d) 

* «> i (d) 

«) ,. (d) 

(d> <• (d) 

1 «» " (s) 



J (d> 



(s) 



Erackets a Parentheses 
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Table 1. ( CO nt.) 



< 
> 



(?) 

(s) 
(d) 



(d) 
W 
W 
[d] 
(d] 
(d) 
(d) 



Other symbols/punctuation marks 



c 


(d ... 


degree) 








(d ... 

(A) 


degree centigrade) 




(d) 
(d) 


% 


(s) 






(d) 




(d) 






(d) 




(?) 






(d) 


@ 


(d) 




> 

C 


(d) 


G 

r. 
w 


(s) 




0 


(d) 


(d) 






(d — minutes) 


(?) 






(d -•• -icondi) 




(d) 
(s) 




V 


(d -•• zip coie svnibol) 
(d) 




(d) 






(s) 




(s) 




'5 


(d) 


-r 


(d) 




S 


(s) 


-)- 


(s) 




' r 


(d) 




(d) 




c 


(d) 


X 


W) 
W) 






(number, phrase repeat sign) 

(symbol which replaces " L fy or -is" > 



Bullets 
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Table 2 represents a list of classifiers, both general and specific classifiers (such 
as time and currency) used by the heuristic rule module 12 to identify character- 
transitions. In general a classifier will follow the number expression it is classifying. 
Only those classifiers identified in Table 2 as "currency classifiers Part 2" precede the 
number expression. Any time the number is preceded or followed by a classifier the 
number expression includes the classifier as part of the word. That is, there should be no 
break between the number and the classifier. 
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Table 2. 



^ Classifiers 

Zlo^tr^t^ 5 th. classifier can 



General Classifiers 



fit' 
s- 

53* 



vf 
f.V 
£- 
?Z m 
ta- 
ilz- 
ie* 



JSlV 

iv 

T- 

r/i* 

A* 

Hi" 



4fuf 

53:0 

GbbS 
ocaG 
GTof 
Ci&i 
7bbl 
7G7a 
ToG.i 
C'2c 
5Gdc 
Scri2 
SUn 
S-iOd 

-:c2i 

7c02 
5c7l 
793c 
~iUl 
Sdb3 
G3c3 

G3c3.3044 

SSSb 

77«:o 

4c01 

OS2d 

GStlf 

•Jcba 

Cof 

7573 

770c 

oc02 



ill- 5dde 
33* 7foH 
£ri* 30;d.30S: 
zj m 7f3G 

30dZ.30c3.30af 
/"5»A/* 30-:b.3093 

30ab.30f3 

M* 7*:fu 

t'V 30d3.30D 

u s X" 3073.3093 

fSPuV GC;2.9593 

f?S' 5c74.5caG 



13 

B 

JS 
12 



7a2e 
5339 
7bS7 
500b 
G7Da 

Cd;i 

Gb73 
516a 
9Gbb 
54GS 
6256 
G33a 
SOU 

3orc 

30b 1 
30G; 
9320 
G2*Jd 
9GSc 
4cfC 



^£ 

^ a 

as 

nan 

21 



I') 



"bST.G2^0 

30fG.G2;0 

500b.G2;0 

3053.0240 

30Ab.G2;0 

5207.30Sc 

30fu.6705 

3013.6705 

30ab.G70S 

7bS7.G705 

7bS7.G761 

30:G.6761 

305.6761 

30ab.G7Cl 

30fG.55fi 

30f3.5Gfd 

30.ib.5Gfd 

30ab.5Gfd.Sa9e 

30fG.36fd.5a9e 

30f3.55fd.Sa9c 

5a9e 

3053 

30b3 

92ad 

30bb.30r3.30cS 

3323 

5272 



*£i G7G1 has been removed cs 
of 7/29/9 G 
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Table 2. (cont.) 



Time/Date Classifiers: (These always FOLLOW numbers) 





CC42 


Hour 




GG42.9593 


Half Hour 


J A 


520G 


Minute 


B 


G5E5 


Day 


R 


C70S 


Month 




5E7-4 


Year 




TDD 2 


Second 



Currency Classifiers Part 1: (These FOLLOW the number) 



Si; 



7y> 

•<v 
5c 

E3 

A — 

??(not supported by 
7?(not supported by 
7? (not supported by 
??(uot supported by 
7?(not supported by 
7?(not supported by 
??(not supported by 
??(not supported by 
7?(not supported by 
7?(not supported by 
??(not supported by 
??(not supported by 
7?(not supported by 
7?(not supported by 
??(not supported by 
??(not supported by 
??{not sii)i])oricd by 



Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JlS) 
Shift-JIS) 
ShiftJIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 
Shift-JIS) 



5 ISO 
5fl7 

30c9.30ob 
332G 

30d5.30e0.30fl 
30de.30eb.30Af 
30da.30bd 
5143 

30aG.30aa.30D 
5713 

30eb.30fc.30dC.30eb 

30ea.30e9 

330G 

3307 

3313 

331a 

331b 

331d 

3321 

3340 

3350 

3352 

3353 

3354 

3335 

3337 

333S 

333a 

334G 



Yen 

Dollar 

Dollar 

Dollar 

Franc 

Mark 

Peso 

Yuan 

Won 

Yuan 

Ruble 

Lira 

Won (Korea) 
Escudo (Portugal) 
Guilders 

New Cruzeiro (Brazil) 

Iuone (Generic) 

Koruna (Czech Republic) 

Schilling (Austria) 

Pound (Great Britain, etc.) 

Yuan (People's Republic of China) 

Lira (Italy. San Marino. Vatican) 

Rupee (India) 

Squared Ruble 

Squared Franc 

Squared Peso 

Squared Penih 

Squared Pence 

Squared Mark 
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Table 2. (cont.) 

Currency Classifiers Part 2: (These PRECEDE number) 

S 0024 USD 

V 00A5 jpy 

¥ FFES jpy 

S TFOA USD 

£ FFE1 Pl \t D 

" pppJ ?S", Mmea »* in «l«-»>yteYEN00A5 

KPU - noc available in shift-JIS) . 
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Table 3 represents a list of particle delimiters used by the heuristic rule analysis 
module 12 to identify further character-transitions in the stream of text. The right hand 
column lists rules specific to the particle or function shown in the left hand column. The 
rules listed in the right hand column of Table 3 are translated according to the following 
code: BB = break immediately before the panicle; BA = break immediately after the 
particle; K[n] = kanji; T = katakana; H = hiragana; and Pnc = punctuation mark. 
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Table 3 . Particle/Function Word Delimiters 

3.1. Particle/Function Word Delimiters for Word-breaking 
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Table 3. (cont.) 



?L<I± 

FmoihiVmvnl 
753= (nnishil 



S2e5.30S?.304t30Gf 



4e43.Slf3 
G21G.30Gf 



BB BA 
BB BA 



BB BA 



5373.30G1 



BBBA 



4f4C.3Q57 
G2IG.304 4.30Gf 



BB BA if not followed bv*gf 

Ibbba 



J.1TUTTV31 



30<i2.30Sb.30-H.30C: 



BB BA 



jjoredewnl 



305d.303c.2CGT.30Cf 



BBBA 



I* org nam l 



305ri.30Sc.30-:^3030 



IBB BA 



"tilth 

ftorctomel 



Isunntmchi) 



305d.30Sc.303S.3032 



3030.30Gd.3CSr.30Gl 
30-tc 



3BEA 



BB BA 



BB BA if preceded by Kl LL followed bv K2 "K; K>" i = 
XOTingaword.tbl 

BB BA if preceded by K fit followed by T or Pnc ku K is NOT 
any of {0x6211 S?55 

0\5-43e4:g 

OxGbGf 

OxseST irg ] 
BB BA if preceded by T ££: followed by K c: T or ?-: 
BB BAif preceded by Pnc LL followed by K or T 
BA if preceded by H Lb followed by T 

BAif preceded by H ££: followed bv K LL K p \OT i- 
gasuffnctbl -.. 

BB if preceded bvT&fc followed bv H 



30Gf 



BB BAif preceded ££: followed by K 

BB BAif preceded LL followed by T 

BB BAif preceded by T LL followed by u or Pnc 

BB BA of [receded bu L LSt followed bv T cr ?p.c £ 
any of { 0xG2aG^£i 
0\53cS 



i* not 
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Table 3 . (cont.) 



3078 



tjy f n °J 30Cc 



^ l30Gb.30G6 



Idake] 30G0,305i 



fkosol I3053J05I4 
f r :> [sural hn<n_ 
^ b'"J 13084 



*V [nan) 



30G7 



BA if followed by K LL K is NOT |0x666b t± j 
BA if followed by T 
BB if preceded by T 

BB if p^ed by K &£ K is NOT any of, 0 S G2 a6 * S 

0x53cS *3? 

BA if preceded bv H followed by T 
BB if precede hy T fcfcf n ll 

BA if prec eded fcfc loliou-ed bv K K ^ r - ,■ vn- — 
noword.tbl i^K liXOi m 

BB BA if preceded by K && followed by T or Pnc 
BB BA if preceded by T followed by K or Pnc or T 
BA if preceded by H £:£ fallowed by T 




BA preceded hv H ^ foi, B ^ »,:;... T 
BB BA „ preceded by k 4:4: iollow.d by K or T or Pnc 



BB BAif preceded hv T && f ollnw ..„ k- ... « ... r „ 
BAif preteued by k 4c4: followed by K or T 

BB BA if preceded by T LL followed K or T or Pnc 



BB BAif prec eded hv M\ oxv?6 bv T 

0.\79cO.S?5 

0x59Jf.*S5 
OxGcSS. 
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Table3.(cont.) 



Ox982d. 
0x64ab,*Ji 
0x788c. #63 
0x8339, £55 } 
&&K2 isK-OTanvof 
{0x6eGf,*>l 
0.\7acb, *n 
0x5207,StD 
0x7121.#£ 
0x73S9.#£ 
0x7db2.*S! 
0x53e3.2Q 
0x5c71,wtl) 

0x77f3.*5 
0x76e4 ( *'J3 
0x540S.#© 
0x9eba.*E } 

BB BA if preceded followed by T 

BB BA if preceded by T and followed by X or Pnc 

BB if preceded by T 

BB if preceded by K which is NOT any of { 0x6c5a. 

Ox79cO.^ 
0x5113, -tt 
Ox39-Jf. £g 
OxGcSS. 
0x9S2d. S£S 
OxG-Ub, £j= 
CxTSSc. #55 

„ 0.\S339 ( -GO ] 

BA if followed by T 

BA if followed by K which is NOT any of ( OxGeGf. £jg 

Ox7acb. -n 
0x5207.*tD 
C\\7121.*S= 
Cx73S9.*i 
Ox7db2,#Sl 
0\53c3.-D 
0x5c71.tQj 

0\77f3*5 
0.\7Ge4.*"S 
0x5-J0S *g 
OxOcba I 
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Table3. (cont) 



\t Ito] 



30G8 



»'0> JkaraJ | 3 04b 



30SD 



NOT,„ ■owfaAl « Ia j, K0T °' 6 T»P~ 44 KI i 

0x"lc3 #|« 
OxGllb #g 
0x77/3 #5 
Oxoea? 
Ox9S03 «tl 

BB if preceded by T ° x7Gae * s > 

BBifpreceriedbyK^K.N'OT.topren^bi 
BA if followed by T 

BA if followed bv K LL \" i- N-rvr 

. ^ .v i, NOT any of ( OxSffd 

0x7 ic3 SjS 

OxGllb £g 

Ox77f3 ^5 

Ox Sea 7 £g? 

0x9S03 SrJ 
e 

orPnc" 



IS: 



BB BA if preceded bv 



* or Tor Pnc tolled by T or Pr 



BB BA if preceded bv Tor p n ^ n ni 

kamuttbl && f ° llou " ed b >* K K is NOT in, 

BB BA if preceded by Ki followed bv j-o r. f .- ri vnT 

0x4edSSft • ^ Ki 15 any of 

0x5132 «CI 
0x5208 
0x52a9 
0x61/8 #£* 
0x639b 
0x67b6#S 
OxSleaftg 
0x9060*8 

0X9510 *»» *tK2NOTink.„, uf . lbl 



BB if preceded by T 



BB if preceded by K *fi K is NOT any of ( 0x4ed8 * M 

0x5132 ^Cc 
0x5206 S« 
0x52a9^Sfe 
OxGlfSS** 
0\639b SiS 
OxGTbC *K 
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Table 3. (cont) 







OxSleaSg 






0x9060 






. OxDS 10 ^fj| } 
B A if followed by T 






BA if followed bv K LL K is NOT in karasufthl 


:z inij 


30Gb 


BA if followed byT 

BA if followed by K K is NOT any of f OxTGee. - i 

OxSOal. -Ik 

. OxTGac -fx ) 




30S2 


BB BA if preceded by T LL followed by T or K or Pnc 

BB BA if preceded by Pnc LL followed by T or K 

BB BAifp[recedefoyn Kl followed bv K2 Kl i= NOT in 
moprefix.tbl K2 is NOT any of { 0xS*19c ?!g 

0xG39b^n=:) 

BB if preceded by K LL K is NOT in mcprcux.ibl 

BA if followed by K K is NOT any of { 0:;S19: z'V: 

0\639b=r>; 

BA if followed bvT 




|BB DA if followed bv Pnc 
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Table 4 represents a list of Kanji and Kana (i.e. Kana is equivalent to Katakana 
and Htragana) repeat characters used by the heuristic rule analysis module 12 to identify 
additional character-transitions in the stream of text. None of the repeat characters 
5 identified in Table 4 may start a word. 



Table 4. 
Repeat Characters 



* - unc 3005: sji S15S; jis 2139; euc albD 

* = unc 309d; sjs S154; jis 2135; euc albo 
v ' = unc 309e: sja S155; jis 2136; euc albG 
N = unc 30fd; sjs Slol; jis 2133; cue alb3 
v * = unc 30fo: sjs S153; jis 213-1; euc alb4 
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Table 5 represents the honorific prefixes utilized by the heuristic rule analysis 
module 12 to identify additional character-transitions in the stream of text. If character 
[i] is an honorific prefix look at [i+1] and [i-1] and see if that character is found in the 
following table. If found, there is no break between any character, if only one character 
5 is found place a break between the character that was found and the honorific prefix. 



Suffix Table 






S Cb-lc 


iz 


*4ecl 


Ifca 4f3d 




SIb-3 






752S 


© ClOf 




9G75 


5t SSC3 




G599 


^ obs: 




SaGO 






5f0a 


E- C1TG 




Gcd5 


55 53L> 




5T-ln 


£ 'Ahd 




Gbbf 


K SSfd 




-ieOO 






DSd( 






4 fee 



Table 5. 
Honorific Prefix 



m 


6240 




5e7S 


J2 


5ea7 


m 


524d 




SfG2 


ty 


S597 




6d5c 


7Z 


5U9 




5bS5 


us 


S239 


(Z 


Gdl? 




S9a7 




S535 




4eal 


K 


7S34 


ir 


635S 




5927 




S005 


0 


5712 


rr 


5cb3 


H. 


5fl3 


r--, 


5dG9 


tl 


Sbbf 


fo 


66-le 




Gd25 




CGtt 




5f37 


ii 


G756 


</> 


306e 


lT; 


<JfaO 




Sabf 




Sec2 


iH 


795G 


□ 


GOGf 


IU: 


7c3c 








7d7l 


t?. 


39^c 







Prefix Table 

^ 4e-I5 
523G 
-»t 5317 
* 5927 



59c9 
& 59d0 

5d29 
E Ge21 

7d7: 



« Scab 
£ 7 90S-i 
K 9G32 
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10 



A character switch between Kanji to Katakana or Katakana to Kanji usually 
« Kates a word break with some excepUon, Tab, 6 represents the exce nons t this 
rule. Forth, rule to apply there must be a string of at least 4 or more katlna 

^TZ?" 1 " ^ *" 4 kat3kana CharaCtCrSin 3 string, the heuristic rule 
module 12 does not place any breaks. "eunstic rule 

If the Kanji character immediately proceeding a kataka.a string of 4 or more 
c aracters matches with any of those in table 6, the heuristic rule module 1 , do^ n t 

kt" T ;t ' ^ m ° dU,e 1 2 * ^ak be^een Te 

kan J( and the first katakana character. If the kanji immediately following a k ^1 
,nn g o more characters matches ^ ^ Qf ^ ^ ^ a^a 

module 12 does not place any breaks, otherwise the heuristic rule module P ^ 
break between the kanji and the last katakana P 



Table 6. 

Katakana/Kanji Combination 



Exception Prefix 



"I? 

7C 



-leOd 
4e39 
-Ie7e 
4e9c 
-Iccj 

GeSO 
70ad 
71bl 
T701 
7a74 
7b-J0 



nr 

ci 

ill 
K 

ffi 

Hi 



4ff3 

516d 

5317 

5357 

53cc 

5-l2b 

7dl9 

S131 

677G 

SSab 

S910 

S97f 



EJ 56db 
5927 



Be 



5c0f 
5ec3 
& 6025 
S- C025 
*3 SdS5 
£ 900G 
i3 904c 
± 91dl 
ffl 92S0 
: r 7 92f3 



fl- 
ap 
K 

A 



662S 
G765 
G771 
6b21 
Gb6f 
Gbb5 
975e 
9ad5 
9f3b 



WO 98/08169 




Exception Suffix 





ncDd 


?:t 












r r- 

Jf.l 


it 


531G 






5GGS 






576b 




£ 


57 fa 






5S02 




*3 


5S3-5 


/=; 




5SG9 






5bbG 






5c5o 


ii 


all 


5c7 1 






5cn9 


iii 


a 


5cfG 


i/f. 


Jil 


5ddd 




•ill 


oride 


its 


?n 


5c02 


ii7 
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Table 7. 



0621 


i/f 


GeOS 


5ca7 




Ge7e 


oeao 


K- 


72bG 




m 


752S 


5 for 




75c5 


5fG2 


a 


75c7 


G027 


0*3 


7GS4 


G559 


§ 


76ee 


G591 




77f3 


G5cf 


Ji 


793e 


GGlf 




79dl 


6a5f 


w- 


79d2 


Gcb9 




7bal 


6cdo 


IS 


7cdG 


Gd3e 


iC 

t»v 


7dl9 


6d41 


rr*. 


7d20 


6d77 




6171 


Gdb2 




S266 
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8336 




6349 




S3cc 




666b 


XT 


SSfd 


if 


SaOS 


46 


6a9e 


!?. 


Saac 


77 


Scde 


Pi 


See a 


Ci 


917S 


fv 

»o 


925b 




9271 




92fc 


ri 


9G4d 




9S5c 




9cc5 
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bv the hr ie , 8 TT CharaC,CrS ddimiting a " ^ ° f SCntence ** « 

by the heunsnc n,le m odu.e 12 to identify character-transitions in the stream of text. 



Table 8. 
End of Sentence 

The end of sentence is marked by the following character 
maru 1S th * primaij- EOS marker 

maru followed by a close quotation ra ark can :n d,c a te EOS . 

followed bv a particle •"-"eate EOS. except when it is 

<*» «.» at the cmt of . p . ras „;' h ° '"" Mn " " ,ht • b "«' •"»»".■ Thi s „ 
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Accordingly, under step 30 of FIG. 2, the heuristic rule analysis module 1 2 
searches for character-transitions enumerated in the above tables and identifies the 
character-transitions as either a break between characters or a link between characters. 
A break between characters will be also be a break between words. A link between 
5 characters will indicate a potential internal connection between characters within a 
character segment. Under step 32, the heuristic module 12 sets the character-transition 
tags associated with characters in the stream of text based upon the analysis performed 
under Step 30. Thereafter, at Step 34, the heuristic rule module 12 stores the set or 
identified determined character-transition tag associated with a particular character 
10 segment. 

At Step 35, the heuristic rule module 12 passes all character-transition tag data to 
the statistical rule module 14. The statistical rule module will read the character- 
transition tag data, and only analyze those segment with unknown interior character- 
transitions. Subsequent processing of the character string after the processing by 

15 heuristic rule module 12 does not include processing of those character-transitions 
identified by the statistical rule analysis module 12. 

FIG. 3 illustrates a flow chart for implementing the statistical analysis module 
20. The flow chart details a process whereby a window of successive characters in the 
stream of text is compared with the entries in a character-transition table. If the 

20 statistical analysis module identifies a match between the window of characters and an 
entry in the character-transition table, the statistical analysis module sets those character- 
transition tags associated with the characters in the current window. If no matches are 
found in the current window of characters, then the characters can be subjected to further 
analysis by the word breaker 5. After the current window of characters is processed, the 

25 statistical analysis module 20 slides the window across the input stream of text to review 
a different grouping of characters. 

As shown in FIG. 3, sliding window analysis begins at Step 40 wherein the 
statistical rule module 20 receives a segment of character having at least one undefined 
character-transition therein. After Step 40, control proceeds to Step 60 for initialization. 

30 In step 60, the character-transition tag before the first character in the segment and after 
the last character in the segment are set to a break. This allows the system to approach 
the entire character group as an individual word. As processing continues, the default 
character-transition tags associated with the characters in the segment may be 
overwritten to identify either links or word breaks. Once a character-transition is 

35 changed from its default value to those values identifying either links or word breaks, 
the character-transition can not be changed. After Step 60, control proceeds to Step 62. 
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10 



15 



20 



25 



30 



At Steps 62 and 64 the size of the sliding window is initialized. In the preferred 
embodiment, the sliding window has a maximum size of four characters and is 
decremented by one through each loop. Accordingly, at Step 62 a counter V is 
initialized to the starting address of the input text, and a counter T is set to one less than 
the desired window length. At Step 64, the window length is defined to run from »i» to 

At Step 66. the characters in the sliding window are compared with entries in the 
character-transition table 16 to identify character-transitions associated with the 
characters in the segment. 

The character-transition table 16 is derived from analysis of a hand-broken 
corpus of approximately 200,000 character segments. The table 16 stores reliable 
information regarding high-frequency character-transitions. The corpus can include 
umgram, digram, trigram and quadgrams patterns. In the present invention, the corpus 
containing the data for the character-transition table has the following contents- 
character sequence; break pattern; frequency of break pattern; connection pattern - and 
frequency of connection pattern. 

The corpus containing the data for the character-transition table is first run 
through the heuristic rule analysis module 12 to remove any segments which can be 
analyzed by the existing heuristic rules. This prevents the statistical analysis module 20 
from duphcatmg the efforts of the heuristic analysis module 12 

Initial testing of the character-transition table data using the full hand-broken 
corpus showed the following results: en 

migrans 2 ,389 1,632 Tt* 'ST 

digrams 5 7,069 9,732 47 337 v 

'Wans 157,116 7 J16 ,49,800 

quadgrams 222,443 3,614 



218,829 98% 



The character-transition table in the present invention preferably comprises data 
on urugram [single character-transitions], digram [two character-transitions] trigram 
[three character-transitions], and quadgram [four characteMransitions], which leads to a 
maximum often (10) bits of information attached to the quadgram pattern 

In the character-transition table a n-gram pattern would be represented by V 
linked tables. In the present invention a 4 table mechanism would be needed In the 
present invention, the first table of the character-transition table 14 would be- 
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Unicode-character-code start-pos number. 

The first field represents the Unicode value of the character code, start-pos 
represent the portion at which the rest of the pattern starts in the next (i.e., 2nd) table, 
5 and number is the number of patterns which have this character as the starting character. 
Therefore the number of n-gram patterns which start with the character in the first field 
and having sub patterns in the second table is represented by the value of the number 
field. The structure of the rest of the tables is similar except that a uni-gram pattern 
would be having a /0 in the second table and the 2nd field would have the bit map of the 

1 0 pattern. Similarly, the digram patterns would have the /0 in the first field of the third 
table and the bit map in the 2nd field of the third table. 

With continued reference to FIG. 3, after the comparison process with the 
character-transition table at Step 66, control proceeds to Decision Step 68. If no match 
with an entry in the character-transition table is found, then logical flow proceeds to Step 

15 70. If a match with an entry in the character-transition table is found, then logical flow 
proceeds to Step 78. 

At Step 70, the window length variable is reduced by one thereby reducing the 
size of the window of characters. At Step 72, the size of the window is reviewed to 
determine whether the window still exists. If the window includes at least one character, 

20 then processing flows back to Step 64 through Action Box 76. The result of Steps 70, 
72, and 76 is to incrementally reduce the size of the sliding window in an attempt to 
match the characters in the window with smaller entries in the character-transition table. 
However, if the window is reduced to the point where it does not include any characters 
and a match in the character-transition table has not been found, then processing flow to 

25 Steps 74 and 83. 

At Step 74, the counter M i M is incremented by 1 and the counter M 'j" is set to 3. 
This causes the sliding window of successive character to slide to the right by one, and 
resets the size of the sliding window to four characters in length. After Step 74, control 
returns to Action Box 76. 

30 At Step 83, those segments in the character string that still have undefined 

character-transitions after being processed by the Statistical Analysis module 20, are 
passed to the Database analysis module 24 for further processing. 

At Step 78, the statistical analysis module 20 stores the new character- transit ion 
tag values for the matched character segments. After Step 78, the statistical analysis 

35 module 20 increments the memory location counter w i" and resets the window length 
variable to three, at Step 80. After Step 80. control proceeds to Step 82. At Step 82, 
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the first and last character-transitions in the next segment having an undefined character- 
transition is set to a break. After Step 82. control returns to Step 64. 

Example of Heuristic Analysis and Sliding Window Statistical Analysis 

The input stream of text 12NEWBOOKS needs to be analyzed. The heuristic rule 
module 1 2 ■dentifies the number 12 and then passes the segment NEWBOOKS on 
to the statistical analysis module 20. BUUK!) on 

^t^r r ^ nSiti0n t3b,e ,6 ' thiS eXamP ' e ' C ° mainS the fo,lowin 8 ^aracter- 
fNEW] 01111101 
[WBJ 010001 
[BOOJ 00111101 
[KSJ 111101 

Step 1: 

The initial bitmap representation for NEWBOOKS is set to 000101010101010100 
because only the first and last transition are known as breaks at this point 

ooNoi Eoi Wo i Boi Ooi Ooi Koi Soo 

First Window? 
Step 2: 

The counter / is initialized to 1 . 
Step 3: 

The sliding window is defined asfl23 4] = fNEWBl 
Step 4: 

The sequences for the current window which can be looked up in the Character- 

transition I able are: 
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[N E W B] [N E W] [N E] [N] 

(1) fN E W BJ is looked up in the Character-transition Table and not found. 

(2) [N E W] is looked up in the Character-transition Table; it is found with the 
5 bitmap 01111101 which is merged with the existing bitmap: 

00N01E01W01B01O01O01K01S00 
oiNnEiiWoi 

ooNiiEmWoiBoiOoiOoiKoiSoo 

10 

Second Window: 
Step 5: 

i is set to 2. 
Step 3: 

1 5 The sliding window is defined as [2 3 4 5] = [E W B OJ. 
Step 4: 

The sequences for the current window which can be looked up in the Character- 
transition Table are: 

20 [E W B O] [E W B] [E W] [E] 

( 1 ) [E W B O], [E W B], [E W], and [E] are looked up in the Character-transition 
Table and not found. 

25 Third Window: 

Step 5: 

/ is set to 3. 
Step 3: 

The sliding window is defined as [3 4 5 6] = [W B 0 O]. 
30 Step 4: 

The sequences for the current window which can be looked up in the Character- 
transition Table are: 

[WBOO] [WB O] [W BJ [WJ 

35 

(1) [W B 0 O] and [W B OJ are looked up in the Character-transition Table and not 
found. 

(2) [W B] is found with the bitmap 010001 which is merged with the existing 
bitmap: 

40 

ooNhEiiWoiBoiOoiOoiKoiSoo + 

oiWooBoi 
oo N 1 1 E 1 1 WooBo i Oor On i Ko i Soo 
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Fourth Window; 
Step 5: 
/' is set to 4. 
Step 3: 

The sliding window is defined as [4 5 6 7] - fB 0 0 Kl 
Step 4: J 

The sequences for the current window which can be looked up in the Character- 
transition Table are: 

[BOOK] [BOO][BO][B] 

nl m S Z K] c ,0 ° ked UP in thC Charac ter-transition Table and not found 
bitmfp 15 ^ ** bhmaP 1 1 WHiCh iS merged With thC existing 

00N11E11W00B01O01O01K01S00 ♦ 

ooBiiOiiOoi 
ooNiiEiiWooBnOiiOoiKoiSoo 

Fifth Window: 
Step 5: 
/ is set to 5. 
Step 3: 

The sliding window is defined as [5 6 7 81 = [0 0 K SI 
Step 4: J ' 

The sequences for the current window which can be looked up in the Character- 
transition Table are: <"av.ici 

[0OKS][OOK][OO][OJ 

(1) [0 O K S] [O O K], [O O], and [O] are looked up in the Character-transition 

I able and not found. 

Sixth Window? 
Step 5: 

* is set to 6. 
Step 3: 

The sliding window is defined as [6 7 8] = [O K S]. 
Step 4: 

The sequences for the current window which can be looked up in the Character- 
transition Table are: OiU^lCl 
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[O K S] [O K] [O] 

(1 ) [0 K S], [O K], and [0] are looked up in the Character-transition Table and not 
found. 

5 

Seventh Window: 
Step 5: 

i is set to 7. 
Step 3: 

10 The sliding window is defined as [7 8] = [0 K S]. 
Step 4: 

The sequences for the current window which can be looked up in the Character- 
transition Table are: 

15 [KS][K] 

(1) [K S] is found with the bitmap 111101 which is merged with the existing bitmap: 

ooNnEnWooBiiOiiOoiKoiSoo 
20 iiKiiSoi 
ooNnEnWooBi lOiiOnKuSoo 

All transitions in the segment bitmap are now either '00 f or '1 1\ i.e., breaks or 
connections. The identified segments are thus: 

25 

ooNmEiiWooBiiOiiOmKnSoo->| NEW|BOOKS| 

The statistical rule analysis module 20 splits the input segment NEWBOOKS into 
two segments NEW and BOOKS and identifies all segment-internal transitions. This 
30 means that the two new segments do not have to be analyzed anymore, and the segments 
are passed directly to the output module 26. 

The sliding window output may produce any one of three different results for an 

input segment. The first, all transitions are identified as either a break or a link, and 

each entire segment may be verified by the database verification module 22 with a single 

35 lookup. The second, some transitions are identified. In this case, any identified 

transitions will be passed onto the database analysis module 24, which reduces the 

overall processing required to identify the word breaks. The third instance would be 

where no character-transitions are identified. In this instance, a full analysis by the 

database analysis module 24 is required. 
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FIG. 4 illustrates a flow chart for implementing the database verification module 
22. An input is supplied from the statistical analysis module 20, which is an identified 
work, i.e., a character segment with a break before the first character, a break after the 
last character, and all internal character-transitions are defined as a link between 
characters. In step 90, the character segment is compared with lexical entries in the 
dictionary 1 8. In step 92 if a match occurs, the character segment is sent to the output 
module 26. If no match occurs the character string is passed to the database analysis 
module 24 for a full database analysis. It will be appreciated by those skilled in the art, 
that since the input text segment has been identified as a word, the comparing step, step 
90, will compare the input text segment to one or two entries from the dictionary 18. 
Hence, since the number of lexical entries used is so small, any known method of 
sorting may be usedto locate the lexical entries used to verify the identified word. 

FIG. 5 illustrates steps for implementing the database analysis module 24. At 
step 83 the database analysis module receives a segment of characters having 
unidentified character-transitions. The database analysis module can receive the 
segment from either the statistical analysis module 20 or the database verification 
module 22. In step 100. the character segment is compared with lexical entries in the 
dictionary 1 8 to identify a match for the character segment in the dictionary. Once a 
match for the character segment is identified, the unidentified character-transitions in the 
word segment are identified and the completed segment is forwarded to output module 
26 with all word breaks being identified. 

FIG. 6 illustrates a flow chart detailing the comparison Step 100 of FIG. 5. 
At Step 120, the character segments with unknown character-transitions are input to the 
Database Analysis Module 24. After Step 120, control proceeds to Step 122. 

At Step 122, a dictionary-look-up string is created based upon the input data 
received at Step 120. Afterwards, at Step 124, all left sub strings of the dictionary look 
up string are chosen for analysis. After Step 124, control proceeds to Step 126. 

At Step 126, all words found in the dictionary 18 that possibly match the 
dictionary look up string are stored in a candidate word list. Then, at Step 128. the 
candidate word list is sorted in order of priority. The database analysis module 24 can 
sort the candidate word list based on various criteria, such as the starting position in the 
candidate word list or the length of the entries stored in the candidate word list. After 
Step 128. flow proceeds to Step 130. 

At Step 130, the first entry in the sorted candidate word list is compared to the 
dictionary look up string. At decision box 132. the database analyzer 24 determines 
whether there is a match. If a match is found, then control proceeds to Step 138 where 
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the final word boundaries are set and all character-transition tags are set to either a break 
or a link, as determined by the matched entry in the candidate word list. The next 
dictionary look-up-string is then retrieved and the process repeated. 

If a match is not found at Decision Box 132, then control proceeds to Decision 
5 Box 134. At decision box 134, the processor branches control to Step 136 if thercare 
more words in the candidate word list, and the processor branches to Step 140 if there 
are no more words in the candidate word list. At Step 136, the next word from the 
candidate word list is retrieved for comparison with the dictionary lookup string. At 
Step 140 the next substring is retrieved from the dictionary look up string for the 
1 0 comparison at Step 1 30. 

While the invention has been shown and described having reference to specific 
preferred embodiments, those skilled in the art will understand that variations in form 
and detail may be made without departing from the spirit and scope of the invention. 
Having described the invention, what is claimed as new and secured by letters patent is: 
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CLAIMS 

I . A method for locating unidentified breaks between words in an input character 
string formed of a plurality of characters, the method comprising the successive steps of 
5 storing said input character string in a computer memory element, 

identifying at least one morpheme in a first segment of said stored character 

string, 

reducing the number of unidentified word breaks in said stored character string 
by locating a first word break in said first segment of said stored character string based 
10 upon said at least one morpheme, said first word break dividing said first segment into a 
first sub-segment and a second sub-segment, and 

locating further unidentified word breaks in said first and second sub-segments 
by comparing said first and second sub-segments to entries in a dictionary. 

1 5 2. The method of claim 1 , wherein said reducing step further includes verifying said 
first word break by matching a word preceding said first word break with a first entry in 
said dictionary and by matching a word following said first word break with a second 
entry in said dictionary. 

20 3. The method of claim 1 , wherein said identifying step includes the steps of 

locating word breaks and character-transitions by applying a set of rules to said stored 
character string to identify said at least one morpheme. 



4. The method of claim 3, wherein said applying step further comprises 

25 forming a window of successive characters from said stored character string, 

comparing said window of successive characters to entries in a character- 
transition table, and 

identifying said window of successive characters that matches an entry in the 
character-transition table as said at least one morpheme. 

30 

5. The method of claim 4, further comprising the step of decreasing the size of said 
window of characters if no entries in said character-transition table match said window 
of successive characters. 
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6. The method of claim 4. further comprising the step of sliding the window of 
successive characters across said stored character string if no entries in said character- 
transition table match said window of successive characters. 

5 7. The method of claim 4, including the step of forming the character-transition 
table by generating a minimum spanning set of character strings necessary to identify 
character-transitions. 

8. The method of claim 7, wherein the spanning set of character strings includes a 
1 0 plurality of character strings having different lengths. 

9. The method of claim K wherein said reducing step includes the steps of 
detecting a first character-transition in said stored character string based upon 

said at least one morpheme, and 
1 5 locating said first word break as a function of said at least one morpheme and 

said first character-transition. 

1 0. The method of claim 9, wherein said locating step includes the step of 
concatenating a first character and a second character together when said first character- 

20 transition indicates the existence of a connection between characters. 

1 1 . The method of claim 9, wherein said locating step further comprises the step of 
identifying a break between a first character and a second character when said first 
character-transition indicates the existence of a break between characters. 

25 

12. The method of claim 1, wherein said locating step further comprises the steps of 
creating a lookup string from characters within said first sub-segment, 
identifying a dictionary entry that substantially matches said lookup string, and 

30 marking a second word break between the matched lookup string and a character 

that precedes the lookup string and marking a third word break between the matched 
lookup string and a character that follows the lookup string. 

13. A method according to claim 12, further comprising the steps of creating a 
35 candidate word list from a dictionary as a function of said lookup string, and wherein 
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said identifying step includes comparing an entry in said candidate word list with said 
lookup string. 

14. The method of claim 12, further comprising the step of 
5 validating that the matched lookup string is a word. 

15. The method of claim 14, wherein the step of validating the matched lookup 
string includes 

selecting an identified word, from the matched lookup string, and 
1 0 comparing said matched lookup string to a dictionary for determining the 

validity of the identified word . 

1 6. The method of claim 1, further comprising the step, prior to said identifying step, 
of applying a set of heuristic rules to said stored character string to identify a character- 

1 5 transition in said first segment of said stored character string, said identification of a 
character-transition reducing the number of possible character combinations forming 
words in said stored character string. 

1 7. The method of claim 1 6 further comprising the step of identifying a 

20 concatenation between characters in said first segment as a function of said heuristic 
rules. 

1 8. The method of claim 16 further comprising the step of selecting said heuristic 
rules for identifying a break between characters in said First segment. 

25 

19. The method of claim 16, wherein said step of applying the set of heuristic rules 
further comprises 

locating a number in said stored character string, and 
identifying a character-transition that precedes and a character-transition that 
30 follows said located number. 

20. The method of claim 16, wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying punctuation in said stored character string, and 
35 identifying a character-transition that precedes and a character-transition that 

follows said located punctuation. 
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21. The method of claim 16, wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying Roman letters in said stored character string, and 
identifying a character-transition that precedes and a character-transition that 
follows said located Roman letters. 

22. The method of claim 16 ? wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying classifiers in said stored character string; and 
identifying a character-transition that precedes and a character-transition that 
follows said located classifiers. 

23. The method of claim 16, wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying particles in said stored character string, and 
identifying a character-transition that precedes and a character-transition that 
follows said located particles. 

24. The method of claim 1 6, wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying honorific prefixes in said stored character string, and 
identifying a character-transition that precedes and a character-transition that 
follows said located honorific prefixes. 

25. The method of claim 1 6, wherein said step of applying the set of heuristic rules 
further comprises 

locating an identifying emperor year in said stored character string, and 
identifying a character-transition that precedes and a character-transition that 
follows said located emperor year. 

26. The method of claim 16, wherein said step of applying the set of heuristic rules 
further comprises 

locating identifying Kanji-Katakana character-transitions in said stored character 
string, and 
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identifying a character-transition that occurs at said located Kanji-Katakana 
character-transition. 

27. A programmable computer an apparatus for locating unidentified breaks between 
5 words in an input character string, comprising 

A) a computer memory element for storing the input character string, 

B) first memory means for storing a character-transition table including 
character segments of morphemes, 

C) second memory means for storing a dictionary, said dictionary including 
1 0 lexical entries, 

D) a statistical analysis module operably coupled with said first memory 
means storing character-transition table for reducing the number of unidentified word 
breaks by locating a first word break in a first segment of said input character string as a 
function of at least one statistical morpheme in said first segment, said first word break 

1 5 dividing said first segment into a first sub-segment and a second sub-segment, and 

E) a database analysis module operably coupled with said dictionary for 
locating substantially all of the remaining unidentified word breaks in said first and 
second sub-segments by comparing said first and second sub-segments with entries in 
said dictionary. 

20 

28. The apparatus of claim 27. wherein said statistical analysis module further 
comprises 

first processing means for identifying said at least one statistical morpheme in 
said first segment by comparing said first segment with entries in said character- 
25 transition table and for detecting a character-transition associated 
with said at least one statistical morpheme, and 

second processing means for locating a first word break in said first 
segment as a function of said at least one statistical morpheme and said character- 
transition. 

30 

29. The apparatus of claim 28, wherein said first processing means further comprises 
a windowing module for forming a window of successive characters from said first 
segment such that said window of characters can be compared with entries in said 
character- transition table. 
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30. The apparatus of claim 29, wherein said first processor module includes means 
for sliding said window of successive characters along said first segment of said input 
character string. 

5 31. The apparatus of claim 29, further comprising means for changing the size of 
said window of characters. 

32. The apparatus of claim 28, further comprising means for associating a character- 
transition tag with characters in said input string. 

10 

33. The apparatus of claim 32, wherein said means for associating a character- 
transition tag includes means for indicating a concatenation between successive 
characters. 

1 5 34. The apparatus of claim 32, wherein said character-transition tag indicates a break 
between successive characters. 

35. The apparatus of claim 27, wherein said database analysis module further 
comprises: 

20 third processing means for identifying a match between said first sub-segment 

and an entry in said dictionary, and 

fourth processing means for locating a second word break in said first sub- 
segment as a function of said matched entry. 

25 36. The apparatus of claim 27, further comprising: 

a heuristic rule table including a set of heuristic rules, 
a heuristic rule module operably coupled with said heuristic rule table for 
identifying a character-transition in said first segment of said stored character string, 
such that the number of possible character combinations forming words in said stored 
30 character string are reduced. 

37. The apparatus of claim 27, further comprising a word verification module, 
operably coupled with said dictionary, for verifying matches between an identified word 
in said input character string and dictionary entries. 
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38. The apparatus of claim 27, wherein said character-transition table includes 
character strings of morphemes that form a minimum spanning set necessary to identify 
character-transitions in said input character string. 

39. The apparatus of claim 38, wherein the spanning set includes a plurality of 
character strings having different lengths. 

40. A machine readable data storage medium, comprising 

means for reducing the number of unidentified word breaks in a character string 
by locating a first word break in a first segment of said character string as a function of 
at least one statistical morpheme in said first segment, said first word break dividing said 
first segment into a first sub-segment and a second sub-segment, and 

means for locating substantially all of the remaining unidentified word breaks in 
said first and second sub-segments by comparing said first and second sub-segments 
with entries in a dictionary of lexical entries. 

41 . The machine readable data storage medium of claim 40, further comprising a 
character-transition table including character segments of morphemes. 
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